1. Resources

R for Data Science: https://r4ds.had.co.nz

R cheatsheets: https://rstudio.com/resources/cheatsheets/

Base R cheatsheet: https://rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf

ggplot cheatsheet: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

2. Create an R Project

https://r4ds.had.co.nz/workflow-projects.html

8.4 RStudio projects

R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects.

Let’s make a project for you to use while you’re working through the rest of this book. Click File > New Project, then:

A project is an entity that contains:

Create a new project every time you start a new project

3. R Studio

A quick orientation to your R studio windows:

A more detailed orientation to R Studio:

4. R script

You should create a new R Script every time you are working on a new task within your project. Think about the R script as a word document that you can write in, edit, save, then open later to work on again. To open up an R script go to New File > R Script.

A new window will open up called “Untitled1”

Type this code into your R script:

a <- 3
b <- 4

a + b

What happened when you pressed enter after a + b? Did you get this error message: “Error: object ‘a’ not found”?

Why didn’t it work? If you just pressed enter after each line, you just went to a new line of your code. In order for R to know what “a” and “b” are, you need to execute this code so that R knows that you are assigning the value “3” to the variable “a” and the value “4” to the variable “b.”

To execute your code, put your cursor back on the first line or highlight “a <- 3” and press command + enter (mac) or control + enter (windows). Do the same for “b <- 4” and “a + b.” Now what happened? You should have gotten an output in your console that “a + b” equals 7.

Look at your R environment. Before you executed your code, this window should have been empty but now after executing your code, you have two new variables, a and b. R will save these values in your environment for each working session, which means that later on in your code if you refer to “a” again, R will remember that “a = 3.” However, if you close your R script or R project, it will not save your environment unless you tell it to. In general, it’s good practice not to save any information between working sessions but rather to start fresh. There are two reasons for this. First, in your next session, you may start working in a new R script and you may want to assign “a” to a new value. Second, your code should always be 100% reproducible. For example, if you call “a + b” without first assigning “a” in this code, then you should get the error message “Error: object ‘a’ not found” to remind you that you need to assign “a” a value.

Sometimes you may want to clear your environment within a session, particularly if you’re switching from working on one R script to another. You can do so with this command:

remove(list = ls())

4. Data Munging

A huge part of my job is what I call data munging. This is the process of taking raw unclean data and cleaning, processing, and formatting it so that it is in a more useable form.

4.1 Read in your data

For this exercise, open up a new R script. At the top, make sure you read in tidyverse:

library(tidyverse)

Now let’s find some data! Your data “lives” somewhere on your computer. In order to read the data into R, we need to tell R where on your computer the data reside. Just as you would give someone directions to your address, you must also tell R the directions to your file. To find the directory to your file, you can use this command. It will pop up a window on your computer, navigate to where the file lives and press enter.

Find the file “bat_counts.csv” on your computer:

file.choose()

Then copy this directory and paste it inside the command read_csv()

bat_counts <- read_csv("insert_your_file_path_here")

Eventually you’ll want to store this file in a data folder called “data” that is in your R project folder, and then you can read it in using a relative path rather than a direct path:

bat_counts <- read_csv("data/bat_counts.csv")

But we will cover this in more detail in a follow-up R tutorial.

For now, let’s dig into these data! To take a first look, here are some options…

Option 1: Call the data

bat_counts
## # A tibble: 100 x 5
##     site  mylu  myse  myso  pesu
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1    84    19   984    37
##  2     2   114    16   963    54
##  3     3    87    22   989    64
##  4     4   102    16   981    47
##  5     5   111    13  1017    53
##  6     6    93    22  1016    32
##  7     7   134    21  1039    39
##  8     8   102    18  1014    32
##  9     9    92    11   946    54
## 10    10   100    30   995    53
## # … with 90 more rows

Option 2: See just the first 5 rows of the data

head(bat_counts)
## # A tibble: 6 x 5
##    site  mylu  myse  myso  pesu
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1    84    19   984    37
## 2     2   114    16   963    54
## 3     3    87    22   989    64
## 4     4   102    16   981    47
## 5     5   111    13  1017    53
## 6     6    93    22  1016    32

or first 10 rows

head(bat_counts, 10)
## # A tibble: 10 x 5
##     site  mylu  myse  myso  pesu
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1    84    19   984    37
##  2     2   114    16   963    54
##  3     3    87    22   989    64
##  4     4   102    16   981    47
##  5     5   111    13  1017    53
##  6     6    93    22  1016    32
##  7     7   134    21  1039    39
##  8     8   102    18  1014    32
##  9     9    92    11   946    54
## 10    10   100    30   995    53

or last 20 rows:

tail(bat_counts, 20)
## # A tibble: 20 x 5
##     site  mylu  myse  myso  pesu
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1    81   115    12   998    61
##  2    82    98    27  1018    58
##  3    83    95    20   993    37
##  4    84   109    22  1024    44
##  5    85   101    21  1021    48
##  6    86   108    18   997    44
##  7    87   108    22  1033    58
##  8    88   107    21   966    42
##  9    89    99    17   986    44
## 10    90   106    22  1009    27
## 11    91    78    14   961    55
## 12    92    93    20   996    55
## 13    93   118    24  1045    44
## 14    94    93    18   988    55
## 15    95    81    16  1023    46
## 16    96   108    22  1016    49
## 17    97    98    29   983    43
## 18    98   105    12   993    46
## 19    99    93    17   990    55
## 20   100    89    15   959    51

Option 3: View entire dataset in a separate window

View(bat_counts)

Here’s a quick checklist about what you should look at whenever you get a new dataset:

  • What columns are in the dataset?
  • What type of data are in each column (is it a number? a true/false value? a date? a set of character values?)
  • What range of values are in each column?
  • How are the data formatted?
str(bat_counts)
## tibble [100 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ site: num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ mylu: num [1:100] 84 114 87 102 111 93 134 102 92 100 ...
##  $ myse: num [1:100] 19 16 22 16 13 22 21 18 11 30 ...
##  $ myso: num [1:100] 984 963 989 981 1017 ...
##  $ pesu: num [1:100] 37 54 64 47 53 32 39 32 54 53 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   site = col_double(),
##   ..   mylu = col_double(),
##   ..   myse = col_double(),
##   ..   myso = col_double(),
##   ..   pesu = col_double()
##   .. )

It is probably not immediately obvious what these data contain but the file name gives it away. These are counts of four different species of hibernating bat that have all been highly impacted by white-nose syndrome:

Myotis lucifugus (little brown bat) = mylu

Myotis septentrionalis (northern long-eared bat) = myse

Perimyotis subflavus (tri-colored bat) = pesu

Myotis sodalis (Indiana bat) = myso

4.2 Transforming data

When I get count data from state managers it is often in this format, which we call the “wide” data format. It is wide because instead of putting counts all in one single column, they are spread out among four columns and the column names are the name of the species. However, to plot data in R and to conduct any analyses, we need the data to be in what we call “long” format. Luckily in R, you can easily switch between wide and long format:

If switching from wide to long, we use the function gather() which has the following inputs:

  1. A list of the columns you want to condense down into a single column. In our example, that will be the columns with the species codes: c(mylu, myse, pesu, myso)

  2. The name of the new column that will hold the column names for the data you are gathering. In our example, the column names are the species codes and so we will name that column by specifying key = species.

  3. The name of the new column that will hold the values for the data you are gathering. In our example, the values being gathered are counts of each bat species so we will name that column by specifying value = count.

Here is the code:

bat_counts_long <- bat_counts %>% 
  gather(c(mylu, myse, pesu, myso), key = "species", value = "count")

bat_counts_long
## # A tibble: 400 x 3
##     site species count
##    <dbl> <chr>   <dbl>
##  1     1 mylu       84
##  2     2 mylu      114
##  3     3 mylu       87
##  4     4 mylu      102
##  5     5 mylu      111
##  6     6 mylu       93
##  7     7 mylu      134
##  8     8 mylu      102
##  9     9 mylu       92
## 10    10 mylu      100
## # … with 390 more rows

Now to switch from long to wide, we use the function spread which has the following inputs:

  1. The name of the column that holds names of the new columns after spreading the data. In our example, we will go back to the original wide format of the data where the species codes were the column names. We do that by specifying key = species.

  2. The name of the column that holds the values that will populate the new spread columns. In our example, the bat counts populated the data associated with each species code. We do this by specifying value = count.

bat_counts_long %>% 
  spread(key = species, value = count)
## # A tibble: 100 x 5
##     site  mylu  myse  myso  pesu
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1    84    19   984    37
##  2     2   114    16   963    54
##  3     3    87    22   989    64
##  4     4   102    16   981    47
##  5     5   111    13  1017    53
##  6     6    93    22  1016    32
##  7     7   134    21  1039    39
##  8     8   102    18  1014    32
##  9     9    92    11   946    54
## 10    10   100    30   995    53
## # … with 90 more rows

4.3 Other data munging tools

Add a new column:

bat_counts_long %>% 
  mutate(age = "adult")
## # A tibble: 400 x 4
##     site species count age  
##    <dbl> <chr>   <dbl> <chr>
##  1     1 mylu       84 adult
##  2     2 mylu      114 adult
##  3     3 mylu       87 adult
##  4     4 mylu      102 adult
##  5     5 mylu      111 adult
##  6     6 mylu       93 adult
##  7     7 mylu      134 adult
##  8     8 mylu      102 adult
##  9     9 mylu       92 adult
## 10    10 mylu      100 adult
## # … with 390 more rows

Filter data:

bat_counts_long %>% 
  filter(species == "mylu")
## # A tibble: 100 x 3
##     site species count
##    <dbl> <chr>   <dbl>
##  1     1 mylu       84
##  2     2 mylu      114
##  3     3 mylu       87
##  4     4 mylu      102
##  5     5 mylu      111
##  6     6 mylu       93
##  7     7 mylu      134
##  8     8 mylu      102
##  9     9 mylu       92
## 10    10 mylu      100
## # … with 90 more rows

Select specific columns:

bat_counts_long %>% 
  select(species, count)
## # A tibble: 400 x 2
##    species count
##    <chr>   <dbl>
##  1 mylu       84
##  2 mylu      114
##  3 mylu       87
##  4 mylu      102
##  5 mylu      111
##  6 mylu       93
##  7 mylu      134
##  8 mylu      102
##  9 mylu       92
## 10 mylu      100
## # … with 390 more rows

Summarize the data:

bat_counts_long %>% 
  group_by(species) %>% 
  summarise(mean_count = mean(count),
            sd_count = sd(count))
## # A tibble: 4 x 3
##   species mean_count sd_count
##   <chr>        <dbl>    <dbl>
## 1 mylu         101.     10.2 
## 2 myse          19.9     4.19
## 3 myso         999.     26.5 
## 4 pesu          48.8    10.0

For more on data munging, see chapter 5 in the R for Data Science book: https://r4ds.had.co.nz/transform.html

4.4 Joining datasets

I’ll provide a short, quick example of how we can join datasets, but there’s a number of ways to join datasets and this goes beyond the scope of this intro session. You can check out more here: https://r4ds.had.co.nz/relational-data.html

Let’s imagine we have temperature data for each cave or as it is called in our dataset, “site,” and we want to merge this with our count data.

First read in the bat_sites.csv data:

bat_sites <- read_csv("data/bat_sites.csv")

Take a quick look at this dataset. What is in it?

bat_sites
## # A tibble: 100 x 2
##     site temperature
##    <dbl>       <dbl>
##  1     1        7.92
##  2     2        5.86
##  3     3        6.42
##  4     4        7.65
##  5     5       10.9 
##  6     6        5.88
##  7     7       10.3 
##  8     8       11.9 
##  9     9        8.81
## 10    10       12.3 
## # … with 90 more rows

Now let’s merge it with our bat counts:

bat_together <- full_join(bat_counts, bat_sites)

bat_together
## # A tibble: 100 x 6
##     site  mylu  myse  myso  pesu temperature
##    <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
##  1     1    84    19   984    37        7.92
##  2     2   114    16   963    54        5.86
##  3     3    87    22   989    64        6.42
##  4     4   102    16   981    47        7.65
##  5     5   111    13  1017    53       10.9 
##  6     6    93    22  1016    32        5.88
##  7     7   134    21  1039    39       10.3 
##  8     8   102    18  1014    32       11.9 
##  9     9    92    11   946    54        8.81
## 10    10   100    30   995    53       12.3 
## # … with 90 more rows

You can see that using the function full_join we were able to join our bat counts together with our site data. How did R know how to join these two datasets together? It recognized that there was a common column name in both these datasets - “site” - and used that column to join the count data with the temperature data for each site. You can also specify what column you want R to join the datasets:

bat_together <- full_join(bat_counts, bat_sites, by = c("site"))

And if you happened to name the column different names in each dataset, don’t worry, you can specify that too. I won’t go over it now, but if there is opportunity we can go over it in a future follow-up R session. You can also check it out in that link above.

5. Plotting

Okay, now let’s do something fun. Let’s make a plot!!!

You can start a new R script if you want. If you do, make sure to read in tidyverse at the top of your code:

library(tidyverse)

First we need some data. R has a number of datasets already loaded that you can access. You can find a description of them here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

A really popular dataset is called “iris” and contains measurements on flower sepal length, sepal width, petal length, and petal width for several species of flower.

Let’s take a quick look at these data using one of the functions to explore the data that I introduced above. What can you tell me about the iris dataset?

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Scatter plot

Let’s say we want to look at how sepal length is related to sepal width

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width)) +
  geom_point()

Your plot should pop up in the Plots window. You can press Zoom to get a better view of the plot. Or click Export and save it.

We can make this figure more complex:

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point()

We can add a fitted line if we think there is a linear relationship between sepal width and sepal length:

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm")

We can also make this figure more beautiful:

theme_set( #theme_set will make this setting for all plots within this session
  theme_bw() + #this will make the background black and white instead of gray
  theme(panel.grid = element_blank()) #this will remove the grid lines 
  )

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm")

We can also label the x and y axes:

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm") +
  xlab("Sepal Length (cm)") +
  ylab("Sepal Width (cm)")

We can also make the points bigger and adjust the shape or shading:

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth(method = "lm") +
   geom_smooth(method = "lm") +
  xlab("Sepal Length (cm)") +
  ylab("Sepal Width (cm)")

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, color = Species)) +
  geom_point(size = 2, shape = 1) +
  geom_smooth(method = "lm") +
   geom_smooth(method = "lm") +
  xlab("Sepal Length (cm)") +
  ylab("Sepal Width (cm)")

Instead of noting species by color, we could can change the shape and lineytpe by species (good if you need to print the figure in black and white):

iris %>% 
  ggplot(aes(x = Sepal.Length, Sepal.Width, shape = Species, linetype = Species)) +
  geom_point() +
  geom_smooth(method = "lm") +
   geom_smooth(method = "lm") +
  xlab("Sepal Length (cm)") +
  ylab("Sepal Width (cm)")

Bar plots

Making a bar plot is not immediately intuitive in R. First, you should know that bar plots give you the ability to plot the summary of your data unless you specify to plot the data as is. Let me show by example…

Let’s say you want to plot the sample size by species (in other words, how many individuals of each species was measured in this dataset?):

iris %>% 
  ggplot(aes(x = Species)) +
  geom_bar()

The figure isn’t very exciting, but it’s important to understand what’s happening behind the scenes. Take a minute to interpret this figure. What is it telling us? And what exactly happened when we asked R to plot a bar graph with “species” on the x-axis?

The figure above tells us that there were 50 individuals for each of the 3 species in our iris dataset. When we called the ggplot + geom_bar code, what R did was it counted how many entries our data had for each Species, or in other words, how many times did setosa, versicolor, and virginica appear in our dataset. It did the math for us, and then plotted it! That’s pretty cool.

What if we want to make a bar plot with something like mean sepal length by species?

First we need to do the math:

iris %>% 
  group_by(Species) %>% 
  summarise(mean_sepal_length = mean(Sepal.Length))
## # A tibble: 3 x 2
##   Species    mean_sepal_length
##   <fct>                  <dbl>
## 1 setosa                  5.01
## 2 versicolor              5.94
## 3 virginica               6.59

Then we plot it:

iris %>% 
  group_by(Species) %>% 
  summarise(mean_sepal_length = mean(Sepal.Length)) %>% 
  ggplot(aes(x = Species, y = mean_sepal_length)) +
  geom_bar(stat = "identity") 

Notice that now we have to specify a y-axis and specify inside our geom_bar call that the statistic (“stat”) to use is the actual number in that column (“identity”).

Notice also that I first had to summarize the data iris to produce a mean value for sepal length (mean_sepal_length) and then I piped it into the plotting function. I could also have done this:

iris_mean_sl <- iris %>% 
  group_by(Species) %>% 
  summarise(mean_sepal_length = mean(Sepal.Length)) 

iris_mean_sl %>% 
  ggplot(aes(x = Species, y = mean_sepal_length)) +
  geom_bar(stat = "identity") 

How do the two variations of the code differ and what is the advantage to doing one over the other?

We could leave this lesson on bar plots here, but that would be unsafitisfying beccause we are missing a CRITICAL component to the figure above. Can anyone guess what that is???

One should NEVER show a mean value with showing the variability around that mean. Write it down, tattoo on your arm, use it as your daily mantra!

Here is our complete figure with mean and standard error bars. I have also added a couple stylistic elements - I changed the alpha (shdaing) on the fill of the bars to 0.5, which makes it look nicer, and I adjusted the width of the upper and lower bars on the errorbar to 0.1. You can play with both these values to see how it changes the aesthetics of your figure:

iris %>% 
  group_by(Species) %>% 
  summarise(mean_sepal_length = mean(Sepal.Length),
            sd_sepal_length = sd(Sepal.Length), 
            n = n(),
            se_sepal_length = sd_sepal_length/sqrt(n)) %>% 
  ggplot() +
  geom_bar(aes(x = Species, y = mean_sepal_length), 
           stat = "identity",
           alpha = 0.5) + 
  geom_errorbar(aes(x = Species, 
                    ymin = mean_sepal_length - se_sepal_length, 
                    ymax = mean_sepal_length + se_sepal_length),
                width = 0.1) + #width adjusts the width of the upper and lower bars of the errorbar
  ylab("Mean Sepal Length (cm)")

6. Mapping!

There were some people interested in mapping. Mapping is becoming more accessible in R. Many people still use ArcGIS and while it can do a lot, R is quickly catching up AND one of the benefits to using R is that it is open source software, meaning you don’t have to pay for it and that anyone can write a package to add to mapping/spatial functionality in R, so it is constantly evolving new capabilities!

You can start a new R script and then make sure to specify what libraries you’ll be using at the top of our code:

library(tidyverse)
library(sf)
library(leaflet)

Now look for and read in the file “ddcsp_scholars_2020.csv”:

Let’s take a quick look at these data:

scholar_locations
## # A tibble: 20 x 4
##    name                   university                          latitude longitude
##    <chr>                  <chr>                                  <dbl>     <dbl>
##  1 Isaac Aguilar          UC Berkeley                             37.9    -122. 
##  2 Cristobal Castaneda    UC Davis                                38.5    -122. 
##  3 Jean Nikolas Clemente  Bowdoin College                         43.9     -70.0
##  4 Jennifer Delgado       Carleton College                        44.5     -93.2
##  5 Michael Angelo Fernan… University of Guam                      13.4     145. 
##  6 Kathryn Hart           Indiana University                      39.2     -86.5
##  7 Kian Kafaie            Brown University                        41.8     -71.4
##  8 Tahiry Langrand        Arizona State University                32.9    -112. 
##  9 Adriana Maciel Metal   Yale University                         41.3     -72.9
## 10 Shantae McIntyre       University of Connecticut               41.8     -72.3
## 11 Amina Muumin           University of Minnesota                 45.0     -93.2
## 12 Kellie Navarro         Bowdoin College                         43.9     -70.0
## 13 Genesis Perez          Inter American University of Puert…     18.2     -66.3
## 14 Marion Powell          Willamette University                   44.9    -123. 
## 15 Amy Quintanilla        Wellesly College                        42.3     -71.3
## 16 Sheila Marie Ringor    University of Hawa'I at Manoa           21.3    -158. 
## 17 Isabella Rosado        Yale University                         41.3     -72.9
## 18 Fana Scott             Eckerd College                          27.7     -82.7
## 19 Joshua Stewart         University of Miami                     25.7     -80.3
## 20 Valerie Tafoya         DePaul University                       41.9     -87.7

Let’s map our locations! There are several ways to do this in R, but this one is my favorite.

First we will use functions from package sf to turn our dataset into a spatial features dataframe:

scholar_locations_sf <- st_as_sf(scholar_locations, coords = c("longitude", "latitude"), crs = 4326)

Then we map it, and check it out, you can move the map and zoom in and out!

scholar_locations_sf %>% 
  leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers()

We can refine this map to make it prettier by setting the view (setView) to center the map and to change the underlying map layer:

scholar_locations_sf %>% 
  leaflet() %>% 
  setView(lng = -100, lat = 40, zoom = 3) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>% 
  addCircleMarkers()

You can take a look at all the different types of basemaps available and switch them out with providers$Stamen.Watercolor: https://leaflet-extras.github.io/leaflet-providers/preview/

We could also label the circles to see who is where:

scholar_locations_sf %>% 
  leaflet() %>% 
  setView(lng = -100, lat = 40, zoom = 3) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>% 
  addCircleMarkers(label = ~name)

We could look at who is where and what university they are at:

scholar_locations_sf %>% 
  leaflet() %>% 
  setView(lng = -100, lat = 40, zoom = 3) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>% 
  addCircleMarkers(label = ~paste(name, university, sep = "; "))

We have a problem with a number of scholars that are going to the same school. This part is pretty fancy but actually not that difficult to implement. We can cluster points with two more lines of code:

scholar_locations_sf %>% 
  leaflet() %>% 
  setView(lng = -100, lat = 40, zoom = 3) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>% 
  addCircleMarkers(label = ~paste(name, university, sep = "; "),
                   group = ~university, #this tells leaflet what column to cluster by
                   clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F)) #this clusters the points

Where is Abe? Let’s add Abe to the map. We can do this without including him in the scholar_locations data:

scholar_locations_sf %>% 
  leaflet() %>% 
  setView(lng = -100, lat = 40, zoom = 3) %>% 
  addProviderTiles(providers$Stamen.Watercolor) %>% 
  addCircleMarkers(lng = -121.9692444, lat = 36.9747237, label = "Abe!") %>% 
  addCircleMarkers(label = ~paste(name, university, sep = "; "),
                   group = ~university, #this tells leaflet what column to cluster by
                   clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F)) #this clusters the points

7. Help!

There is nothing more frustrating than getting an error message in R! How to decode this cryptic message? Here are a few pointers:

  1. Type ?function_name in the console and press enter. For example, if I want to know more about the function mean, I would type, ?mean in the console and it will bring up the documentation page in the Help file.
?mean
  1. Some functions come with examples. You can access this by typing example(function_name) in the console. For example, to find examples for the function mean, I would type:
example(mean)
## 
## mean> x <- c(0:10, 50)
## 
## mean> xm <- mean(x)
## 
## mean> c(xm, mean(x, trim = 0.10))
## [1] 8.75 5.50
  1. Google it. Copy-paste the error code and google it. You’ll likely come up with some threads on stack overflow with a similar problem to what you came up with and some answers. Sometimes they help, sometimes they don’t. It just depends.

  2. Ask for help. Throughout this course (and beyond) I am always down to help sleuth R errors. Whether you send me a question or decide to post it online, make sure you provide a reproducible example. A reproducible example includes: required packages, data (or example data), and code.

  3. Sleuth it yourself. You’d be surprised how many error messages you can sleuth on your own, but like a language, it takes time to really understand what’s going on behind the scenes with R. To sleuth code, start by examining all the parts to your code. If it’s just a single function, what inputs did you include? Did any of the inputs break requirements of the function (e.g. you tried to calculate the mean of a column that didn’t contain any numbers)? If you’re running code that pipes through a number of functions, run your code line by line until you get the error message. Oftentimes the error message will give you a hint about what is wrong. It takes some time to get to understand the language of this, but over time, you will get it!

And when in doubt, ask for help! Reach me at .

8. General R conventions

  1. List all the libraries you will need to run your code at the top of your code.

  2. Follow the tidyverse style guide (https://style.tidyverse.org/index.html)

Naming: - short but informative - use lowercase as much as you can - never use a space or period (.) to separate words; you can use "_" (tree_data) or camelcase (treeData) - whatever you use, be consistent and don’t mix naming conventions (don’t do: tree_Data)

Coding: - use spaces between words/commands and enter new lines to keep your code clean - make comments using # to help you remember what you did

  1. Never copy-paste code. Write it out so that you can learn it. Try to understand all the parts of the function.

9. Reinforce your understanding

1. Explore this page with built-in datasets in R: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

2. Pick a dataset and read it into R

3. What can you tell me about this dataset? Use the commands from the first part of section 4 (plotting) to take a quick peak at the data and describe what is in it.

4. Try out some data munging Try switching your data from wide to long or long to wide. Add a new column. Filter for a variable. Select certain columns. Summarise the data.

5. Make a simple plot! Identify what variables you are interested in looking at. What’s your independent variable (on the x-axis) and what is your dependent or response variable (on the y-axis)?

6. Introduce a third variable. If you want to look at how x & y vary by another variable (in our plotting example, that was species), how could you visualize that? What type of aesthetic (color of the point or fill of the bar, shape of the point, etc.) would you change?

7. Make a map. Grab some coordinates from google map (you can find the lat/long in the address bar) and make a map of your choosing!